Adaptation in Machine Translation
نویسنده
چکیده
Machine translation remains one of the grand challenge problems of natural language processing. Recent advances in the field have led to a number of applications demonstrating the potential and impact of the technology. Statistical machine translation (SMT) has emerged as the currently most promising approach to tackle the translation problem. During the last decade, it advanced to solidly outperform rule-based methods over many tasks and evaluations. The main advantage of the statistical approach is that expert knowledge is no longer manually encoded rule by rule, but translation models can be learned automatically from parallel data consisting of translated texts. This leads to faster and cheaper development of new translation systems. One limitation to date, however, is that the quality of SMT systems strongly depends on the similarity between the training data and its deployment. When a machine translation system is applied to a new task, performance typically drops significantly. And since huge amounts of data are needed for training, it is not possible to collect only matching data for every new application. This thesis is devoted to adapting MT systems in the scenario of mismatching training data. We develop different approaches to increase translation performance even though all or some of the data we used for training does not match the system’s ultimate application. In order to improve translation quality when applying a translation system to a new task, we explore four different approaches: integration of mismatching data, combining matching and mismatching data, adapting the system to very specific topics, and exploiting data matching in genre only. We present techniques and experimental systems that improve the translation quality for a particular type of given data. We show that the context available during the translation process is shorter when using mismatching data. In order to address this problem, we develop a bilingual language model to increase the context that is available during decoding. Using this model we improve the ability of the system to exploit mismatching data and we show that this results in improved translation quality. Our training data is typically derived from varied sources encompassing different topics and genres. Consequently, some parts of the data might match the task better than others. In response, we weight the different parts of the training data so that the influence of the matching data can be increased. By avoiding a binary decision whether the data matches or not, we can make better use of the corpora that match our target task to a certain degree. We show significant improvements by combining models trained on in-domain and out-of-domain data. In order to enable the translation of topic-specific terms, data that matches the topic is needed. For most applications, however, it is difficult to obtain data that matches both in topic and in genre. Therefore, we present an approach to include data in the translation system that only matches the topic of the input data. We use the titles of Wikipedia articles to translate topic-specific terms. Since this type of data does not contain all possible word forms, we also develop techniques to find translations for morphological variations of the same word. Another problem addressed in this thesis is the adaptation of a translation system to the genre of an application. In order to enable better exploitation of data that matches in genre, but not in topic, we present a continuous space language model. We show that this model generalizes better when topic-specific words occur than an n-gram language model. We perform a detailed analysis of the impact of all these approaches on the task of translating different types of data and show their positive influence in systems submitted to international evaluation campaigns.
منابع مشابه
Domain specialization: a post-training domain adaptation for Neural Machine Translation
Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human postediting workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call “specialization” and which is showing promising results both in the learning speed and in...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملContext Adaptation in Statistical Machine Translation Using Models with Exponentially Decaying Cache
We report results from a domain adaptation task for statistical machine translation (SMT) using cache-based adaptive language and translation models. We apply an exponential decay factor and integrate the cache models in a standard phrasebased SMT decoder. Without the need for any domain-specific resources we obtain a 2.6% relative improvement on average in BLEU scores using our dynamic adaptat...
متن کاملA Systematic Adaptation Scheme for English-Hindi Example-Based Machine Translation
The success of Example-Based Machine Translation (EBMT) often depends upon how efficient the adaptation scheme is. Adaptation primarily aims at modifying retrieved examples to meet the required demands of a given translation task. The present work looks at adaptation for EBMT from English to Hindi. This paper describes a rule-driven adaptation scheme for modifying a retrieved translation exampl...
متن کاملUnsupervised Adaptation for Statistical Machine Translation
In this work, we tackle the problem of language and translation models domainadaptation without explicit bilingual indomain training data. In such a scenario, the only information about the domain can be induced from the source-language test corpus. We explore unsupervised adaptation, where the source-language test corpus is combined with the corresponding hypotheses generated by the translatio...
متن کاملOnline adaptation strategies for statistical machine translation in post-editing scenarios
One of the most promising approaches to machine translation consists in formulating the problem by means of a pattern recognition approach. By doing so, there are some tasks in which online adaptation is needed in order to adapt the system to changing scenarios. In the present work, we perform an exhaustive comparison of four online learning algorithms when combined with two adaptation strategi...
متن کامل